%load_ext pretty_jupyter
Shape Technical Report
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
import pandas as pd
from scipy.stats import chi2_contingency
from collections import Counter
from utils import *
import pywt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
import shap
import io
Introduction¶
The development of predictive models for the maintenance of FPSO vessel equipment represents a pivotal advancement in the maritime and oil & gas industries, promising to revolutionize how these sectors approach maintenance management and operational efficiency. By leveraging historical data and advanced analytics, predictive models enable stakeholders to anticipate equipment failures before they occur, facilitating timely maintenance actions that can significantly reduce downtime and associated costs. This proactive approach not only enhances the reliability and safety of FPSO operations but also optimizes the lifecycle of critical equipment, ensuring that maintenance resources are allocated more effectively. In an industry where operational interruptions can lead to substantial financial losses and environmental risks, the ability to predict and mitigate potential failures through advanced modeling techniques is invaluable, driving improvements in operational resilience, cost-efficiency, and environmental stewardship.
To enable the operations of an FPSO, sensors are used to make sure the equipment does not fail. These sensors measure different parameters of the equipment in different setup configurations (preset 1 and preset 2) over time.
Objectives¶
The primary goal of this study is to conduct a comprehensive analysis of failures in FPSO vessels to derive actionable insights that will inform the optimal timing of maintenance interventions.
To achieve this overarching goal, we have identified the following specific objectives:
- Assess Data Quality: Examine the database to evaluate the integrity and quality of the data on FPSO vessel failures.
- Quantify Equipment Failures: Calculate the frequency of equipment failures to understand their prevalence and impact.
- Analyze Failure Correlations: Investigate the relationship between preset conditions and the occurrences of failures, identifying any significant patterns.
- Determine Parameter Influence: Analyze how each parameter contributes to equipment failures, highlighting critical factors.
- Develop a Predictive Model: Construct a model capable of predicting equipment failures based on analyzed variables, enhancing preemptive maintenance strategies.
- Evaluate Variable Significance: Assess the importance of each variable within the predictive model to understand their impact on failure prediction accuracy.
filename = "./O_G_Equipment_Data.xlsx"
data = pd.read_excel(filename)
buf = io.StringIO()
data.info(buf=buf)
s = buf.getvalue()
lines = [line.split() for line in s.splitlines()[3:-2]]
info_df = pd.DataFrame(lines)
Methodology¶
Dataset¶
We have data from a a sensor installed in equipment onboard a FPSO vessel. The sensors measures the Temperature, Pressure, Vibration in X, Y and Z direction, and Frequency of of a certain phenomena important to their operation, as we can see in the dataset first rows presented in the Table 1 bellow.
| Cycle | Preset_1 | Preset_2 | Temperature | Pressure | VibrationX | VibrationY | VibrationZ | Frequency | Fail | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3 | 6 | 44.235186 | 47.657254 | 46.441769 | 64.820327 | 66.454520 | 44.483250 | False |
| 1 | 2 | 2 | 4 | 60.807234 | 63.172076 | 62.005951 | 80.714431 | 81.246405 | 60.228715 | False |
| 2 | 3 | 2 | 1 | 79.027536 | 83.032190 | 82.642110 | 98.254386 | 98.785196 | 80.993479 | False |
| 3 | 4 | 2 | 3 | 79.716242 | 100.508634 | 122.362321 | 121.363429 | 118.652538 | 80.315567 | False |
| 4 | 5 | 2 | 5 | 39.989054 | 51.764833 | 42.514302 | 61.037910 | 50.716469 | 64.245166 | False |
The data set consists in 10 columns, with 800 non-null measurements, structured as it follows:
Cycle: Represents the operation cycle number of the equipment. This column serve as a temporal or sequential index, indicating the order in which the operation cycles were recorded.
Preset_1 and Preset_2: These columns are predefined settings or parameters applied to the equipment before or during each operation cycle. The values in these columns might represent different modes of operation, configuration adjustments, or other parameters relevant to the equipment's functioning.
Temperature: Records the temperature of the equipment or the environment in which the equipment is operating during the cycle. Temperature can be a critical factor affecting the equipment's performance and safety.
Pressure: Similarly, this column indicates the pressure measured during the operation cycle. Pressure is another vital parameter that can influence the efficacy and safety of equipment operations.
VibrationX, VibrationY, VibrationZ: These columns measure the equipment's vibrations in three orthogonal directions (X, Y, Z). Excessive vibrations can be indicative of impending equipment problems or failures, making these measurements important for predictive maintenance and failure prevention.
Frequency: Related to the frequency of vibrations or some other periodic operational measure of the equipment.
Fail: A boolean column (True/False) indicating whether a failure occurred in the equipment during the cycle. This column is essential for reliability analyses and developing predictive models of failures.
The description and information of the data are in Tables 2, and 3.
Table 2: Data Description
| Cycle | Preset_1 | Preset_2 | Temperature | Pressure | VibrationX | VibrationY | VibrationZ | Frequency | |
|---|---|---|---|---|---|---|---|---|---|
| count | 800.0000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 |
| mean | 400.5000 | 1.988750 | 4.551250 | 69.263494 | 78.997945 | 73.860275 | 72.786878 | 71.866211 | 68.223449 |
| std | 231.0844 | 0.805875 | 2.293239 | 25.536252 | 32.501834 | 31.229631 | 32.739745 | 27.844616 | 29.138702 |
| min | 1.0000 | 1.000000 | 1.000000 | 2.089354 | 3.480279 | 3.846343 | 10.057744 | 18.784169 | 4.380101 |
| 25% | 200.7500 | 1.000000 | 3.000000 | 51.040134 | 55.508564 | 50.752461 | 48.523982 | 50.787638 | 45.861762 |
| 50% | 400.5000 | 2.000000 | 5.000000 | 65.906716 | 75.014848 | 69.394953 | 65.504770 | 69.319237 | 65.664252 |
| 75% | 600.2500 | 3.000000 | 7.000000 | 80.527220 | 99.302530 | 90.195059 | 94.075572 | 88.891205 | 90.097457 |
| max | 800.0000 | 3.000000 | 8.000000 | 255.607829 | 189.995681 | 230.861142 | 193.569947 | 230.951134 | 178.090303 |
Table 3: Information
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 0 | # | Column | Non-Null | Count | Dtype |
| 1 | --- | ------ | -------------- | ----- | None |
| 2 | 0 | Cycle | 800 | non-null | int64 |
| 3 | 1 | Preset_1 | 800 | non-null | int64 |
| 4 | 2 | Preset_2 | 800 | non-null | int64 |
| 5 | 3 | Temperature | 800 | non-null | float64 |
| 6 | 4 | Pressure | 800 | non-null | float64 |
| 7 | 5 | VibrationX | 800 | non-null | float64 |
| 8 | 6 | VibrationY | 800 | non-null | float64 |
| 9 | 7 | VibrationZ | 800 | non-null | float64 |
| 10 | 8 | Frequency | 800 | non-null | float64 |
| 11 | 9 | Fail | 800 | non-null | bool |
The exploratory analysis will be done to gain insights about the datase and to assess potential dependencies present in the data. This will be done mailly by varius techniques and visualization to understant the data distribution, characteristics and relationshipt between the parameters.
The last part will be the modeling and prediction. For this work we will use a set of classifiers models, using logistic regression, and decision trees to predict wheter or not the equipment has failed.
The model will be evaluated by the accuracy and reccall. Accuracy can be a good measurementy for balanced dataset, but it has problems with unbalancedondes. The recall is a way to measure how good the model is doing with false positives. Having a higher recall means when in doubt, the model would predict the equipment is failing so we don’t risk missing fixing the equipment that actually is failing. This way, we’re more likely to correctly identify more failures.
Exploratory analysis¶
How many times the equipment has failed?¶
fails_conts = data['Fail'].value_counts()
labels = ['False', 'True']
counts = [fails_conts[False], fails_conts[True]]
fig, ax = plt.subplots()
ax.bar(labels, counts, color=['green', 'red'])
ax.set_ylabel('Count')
ax.set_title('Comparison of Fail and Not Fail Equipment')
for i in range(len(counts)):
ax.text(i, counts[i] + 1, str(counts[i]), ha='center')
ax.grid()
plt.savefig('./figures/fail_not_fail.png', dpi=300 , bbox_inches='tight')
print(f"The equipment has failed {data['Fail'].value_counts().iloc[1]} times")
print(f"The failure rate is {data['Fail'].value_counts().iloc[1]/len(data)*100:.2f}%")
The equipment has failed 66 times The failure rate is 8.25%
There are $66$ Fails against $734$ normal behaviour in the dataset. It represents $8.25\%$ of the data. From here we can have an idea on how unbalanced the dataset is. This difference can be tricky later on when building the models. Some oversampling technique might be helpfull.
Categorize failures by Presets¶
fails_per_preset_1 = data.groupby(['Preset_1', 'Fail']).size().unstack().sort_values(by=True, ascending=False)
fails_per_preset_2 = data.groupby(['Preset_2', 'Fail']).size().unstack().sort_values(by=True, ascending=False)
| Fail | False | True |
|---|---|---|
| Preset_1 | ||
| 1 | 237 | 27 |
| 2 | 260 | 21 |
| 3 | 237 | 18 |
| Fail | False | True |
|---|---|---|
| Preset_2 | ||
| 5 | 88 | 12 |
| 1 | 84 | 11 |
| 2 | 92 | 9 |
| 6 | 92 | 9 |
| 7 | 100 | 9 |
| 8 | 93 | 7 |
| 3 | 95 | 6 |
| 4 | 90 | 3 |
Tables 1 and 2 present the number of Fails for the preset$_1$ (left), and preset$_2$ (right). The configuration $1$ in preset$_1$ was the one with more failures; and the configuration $5$ was the one with more failures in preset$_2$.
combination = data.groupby(['Preset_1','Preset_2', 'Fail']).size().unstack(fill_value=0)
combination['Percentage_fail'] = combination[True] / (combination[True] + combination[False]) * 100
combination['cumulative'] = combination[True].cumsum()
combination.reset_index(inplace=True)
combination.reset_index(inplace=True)
combination_sorted =combination.sort_values(by=True, ascending=False)
| Fail | index | Preset_1 | Preset_2 | False | True | Percentage_fail | cumulative |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 2 | 33 | 5 | 13.157895 | 9 |
| 4 | 4 | 1 | 5 | 26 | 5 | 16.129032 | 18 |
| 0 | 0 | 1 | 1 | 30 | 4 | 11.764706 | 4 |
| 20 | 20 | 3 | 5 | 25 | 4 | 13.793103 | 59 |
| 6 | 6 | 1 | 7 | 34 | 4 | 10.526316 | 25 |
| 8 | 8 | 2 | 1 | 26 | 4 | 13.333333 | 31 |
| 15 | 15 | 2 | 8 | 33 | 4 | 10.810811 | 48 |
| 22 | 22 | 3 | 7 | 31 | 3 | 8.823529 | 65 |
| 21 | 21 | 3 | 6 | 27 | 3 | 10.000000 | 62 |
| 16 | 16 | 3 | 1 | 28 | 3 | 9.677419 | 51 |
| 13 | 13 | 2 | 6 | 34 | 3 | 8.108108 | 42 |
| 12 | 12 | 2 | 5 | 37 | 3 | 7.500000 | 39 |
| 5 | 5 | 1 | 6 | 31 | 3 | 8.823529 | 21 |
| 10 | 10 | 2 | 3 | 24 | 2 | 7.692308 | 35 |
| 14 | 14 | 2 | 7 | 35 | 2 | 5.405405 | 44 |
| 9 | 9 | 2 | 2 | 32 | 2 | 5.882353 | 33 |
| 7 | 7 | 1 | 8 | 22 | 2 | 8.333333 | 27 |
| 17 | 17 | 3 | 2 | 27 | 2 | 6.896552 | 53 |
| 18 | 18 | 3 | 3 | 30 | 2 | 6.250000 | 55 |
| 3 | 3 | 1 | 4 | 20 | 2 | 9.090909 | 13 |
| 2 | 2 | 1 | 3 | 41 | 2 | 4.651163 | 11 |
| 11 | 11 | 2 | 4 | 39 | 1 | 2.500000 | 36 |
| 23 | 23 | 3 | 8 | 38 | 1 | 2.564103 | 66 |
| 19 | 19 | 3 | 4 | 31 | 0 | 0.000000 | 55 |
By looking the picture we just can see that the Failues happens during all the period, and the cumulative sum of failues increases over time. The cummulative curve is pratically linear. Although the curve is well lienarly fitted, not showing clear relation bewteen the presets and the failures, we can observe that around the combination 20, 11, 4,and 23, there is a little drop relatively to the curve. This could indicate that those presets could be countrinuing to a reduction in the cummulative sum. In other words, those combinations can have lower counts of failures.
We can see by looking the table that the combination of presets 3 and 4, had no failures. Also, the percentages of failues for 3-8 and 2-4 were the lowest.
Distribution of values¶
One way to evaluate the contribution of each variable in equipment failure is the distribution of its intensity. The Figure below show the distribution for each of the variables.
parameters = ['Temperature', 'Pressure','VibrationX', 'VibrationY', 'VibrationZ', 'Frequency']
fig, ax = plt.subplots(3, 2, figsize=(15, 15))
axs = ax.T.flatten()
for axx in range(len(axs)):
g = sns.histplot(data, x=parameters[axx], kde=True, element='step', legend=True, ax=axs[axx], common_norm=False)
leg = g.axes.get_legend()
axs[axx].set_xlabel(parameters[axx])
axs[axx].set_ylabel('Count')
new_title = 'Fail'
plt.savefig('./figures/histograms.png', dpi=300, bbox_inches='tight')
The Figure show the distribution of variables measurementes. All the sensors, expect the VibrationX, present a shape more like a bimodal distribution. It can be realated to a subgroup of values. Splitting the data in Failure and not Failure can give us a better understanding.
fig, ax = plt.subplots(3, 2, figsize=(15, 15))
axs = ax.T.flatten()
for axx in range(len(axs)):
g = sns.histplot(data, x=parameters[axx], hue='Fail', kde=True, element='step', legend=True,stat='density', ax=axs[axx], common_norm=False)
leg = g.axes.get_legend()
axs[axx].set_xlabel(parameters[axx])
axs[axx].set_ylabel('Density')
new_title = 'Fail'
leg.set_title(new_title)
new_labels = ['False', 'True']
for t, l in zip(leg.texts, new_labels):
t.set_text(l)
plt.savefig('./figures/histograms_separated.png', dpi=300, bbox_inches='tight')
The distributions in the Figure above shows a clear separation in the Failures and not Failure measurements. We can see that the measurements have higher values when the failures happens. The VibrationZ, VibrationY, and Pressure were the ones that also have a tail with larger values.
Time series analysis¶
Here we investigate if there is a prefferencial cycle or moment of the fail occurs. It offers another view on what is happening to the intensity of the parameters when the fail occur. A time-frequency analysies will be applied as well using Continuous Wavelet trasform in order to attempt to identyfy changes in frequency during the cycles.
falhas = data[data['Fail'] == True]['Cycle']
fig, axs = plt.subplots(5, 1, figsize=(18, 20), sharex=True, dpi=200)
axs[0].plot(data['Cycle'], data['VibrationX'], label='VibrationX', color='blue')
axs[1].plot(data['Cycle'], data['VibrationY'], label='VibrationY', color='red')
axs[2].plot(data['Cycle'], data['VibrationZ'], label='VibrationZ', color='green')
axs[3].plot(data['Cycle'], data['Temperature'], label='VibrationX', color='black')
axs[4].plot(data['Cycle'], data['Pressure'], label='VibrationX', color='gray')
for ax in axs:
for ciclo in falhas:
ax.axvline(x=ciclo, color='k', linestyle='--', linewidth=1)
axs[0].set_ylabel('VibrationX')
axs[1].set_ylabel('VibrationY')
axs[2].set_ylabel('VibrationZ')
axs[3].set_ylabel('Temperature')
axs[4].set_ylabel('Pressure')
axs[4].set_xlabel('Cycle')
axs[0].set_title('VibrationX')
axs[1].set_title('VibrationY')
axs[2].set_title('VibrationZ')
axs[3].set_title('Temperature')
axs[4].set_title('Pressure')
plt.savefig('./figures/Vibration_Analysis.png', dpi=200, bbox_inches='tight')
From the figure above, we ca see that the fail occurs over all the cycles. But, there are some spots with a wigher density of failures, and that happens after cycle 400. It is hard to have some conclusions just by looking this graph. Qualitativelly, it is possible to observe in all parameters that there is a frequancy change when the fail occur. It is more clear in the Temperature measurementes. In adition to that, the Vibration in X direction have a peak, right when the fail batch start, after the cycle 400. It starts to increase the magnitude, untils the fail happens. Some similar behavior are present in the Y and Z direction.
Now, applying the CWT to the parameters we will be able to see the changes in frequency during the cycles.
# Selecionar as componentes para análise
vibration_x = data['VibrationX'].values
vibration_y = data['VibrationY'].values
vibration_z = data['VibrationZ'].values
pressure = data['Pressure'].values
temperature = data['Temperature'].values
scales = np.arange(1, 128)
coef_x, freqs_x = pywt.cwt(vibration_x, scales, 'cmor1.5-0.5')
coef_y, freqs_y = pywt.cwt(vibration_y, scales, 'cmor1.5-0.5')
coef_z, freqs_z = pywt.cwt(vibration_z, scales, 'cmor1.5-0.5')
coef_pres, freqs_pres = pywt.cwt(pressure, scales, 'cmor1.5-0.5')
coef_temp, freqs_temp = pywt.cwt(temperature, scales, 'cmor1.5-0.5')
fig, axs = plt.subplots(5, 1, figsize=(16, 14), sharex=True)
cax1 = axs[0].pcolormesh(np.arange(len(vibration_x)), freqs_x, np.abs(coef_x), cmap='plasma')
fig.colorbar(cax1, ax=axs[0])
axs[0].set_title('VibrationX')
axs[0].set_ylabel('Frequency')
axs[0].set_yscale('log', base=2)
cax2 = axs[1].pcolormesh(np.arange(len(vibration_y)), freqs_y, np.abs(coef_y), cmap='plasma')
fig.colorbar(cax2, ax=axs[1])
axs[1].set_title('VibrationY')
axs[1].set_ylabel('Frequency')
axs[1].set_yscale('log', base=2)
cax3 = axs[2].pcolormesh(np.arange(len(vibration_z)), freqs_z, np.abs(coef_z), cmap='plasma')
fig.colorbar(cax3, ax=axs[2])
axs[2].set_title('VibrationZ')
axs[2].set_ylabel('Frequency')
axs[2].set_yscale('log', base=2)
cax4 = axs[3].pcolormesh(np.arange(len(pressure)), freqs_pres, np.abs(coef_pres), cmap='plasma')
fig.colorbar(cax4, ax=axs[3])
axs[3].set_title('Pressure')
axs[3].set_ylabel('Frequency')
axs[3].set_yscale('log', base=2)
cax5 = axs[4].pcolormesh(np.arange(len(temperature)), freqs_temp, np.abs(coef_temp), cmap='plasma')
fig.colorbar(cax5, ax=axs[4])
axs[4].set_title('Temperature')
axs[4].set_ylabel('Frequency')
axs[4].set_yscale('log', base=2)
axs[4].set_xlabel('Cycle')
rect_height = (ax.get_ylim()[1] - ax.get_ylim()[0])*0.001
rect_width = 1
for ax in axs:
for ciclo in falhas:
rect = patches.Rectangle((ciclo-rect_width/2, ax.get_ylim()[0]), rect_width, rect_height, linewidth=0, edgecolor=None, facecolor='white', zorder=2)
ax.add_patch(rect)
plt.tight_layout()
plt.savefig('./figures/wavelet_Analysis.png', dpi=200, bbox_inches='tight')
We can see from the figure above the signal present some consinuous frequencies, in the lower range of scales during all the period. Therefore, around the cycle 400, and 500 we observe some spikes in frequency, having higher power spectrom power in lower frequency ranges but also going up to higher frequencues. This could indicate some change in the phase in the signal and also some noise can be happening to it. The white rectangles in the bottom represents the cycles where there was a fail. The fail happened when there was a larger power in higher frequencies.
It is difficult to conclude and create a predictot only with this analysis, but the power spectrum density could be a good candidate for a new feature when creating the models.
Modelling¶
For training these models we’ll be doing the following steps:
- Split the data into training and test sets. (80/20 proportion)
- Apply oversampling method only to the data.
- Train the model using the balanced training set.
- Evaluate the model’s performance on the unchanged test set.
Resampling the data¶
In order to overcome the unbalanced data, it will be two methods, the Synthetic Minority Oversampling Technique (SMOTE) and the Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN).
But first, lets build a model with no oversampling. Let's use the Logistic regression model in this first scenario.
data['Fail'] = data['Fail'].astype(int)
X = data.drop(columns=['Cycle','Preset_1', 'Preset_2', 'Fail'])
y = data.drop(columns=['Cycle', 'Preset_1', 'Preset_2', 'Temperature', 'Pressure', 'VibrationX', 'VibrationY', 'VibrationZ', 'Frequency'])
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.20, random_state=0)
ss_train = StandardScaler()
X_train = ss_train.fit_transform(X_train)
ss_test = StandardScaler()
X_test = ss_test.fit_transform(X_test)
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
cm = confusion_matrix(y_test, predictions)
TN, FP, FN, TP = confusion_matrix(y_test, predictions).ravel()
print('True Positive(TP) = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN) = ', TN)
print('False Negative(FN) = ', FN)
True Positive(TP) = 6 False Positive(FP) = 3 True Negative(TN) = 142 False Negative(FN) = 9
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x295d50f50>
accuracy = (TP + TN) / (TP + FP + TN + FN)
print('Accuracy of the binary classifier = {:0.3f}'.format(accuracy))
Accuracy of the binary classifier = 0.925
recall = TP / (TP + FN)
print('Recall of the binary classifier = {:0.3f}'.format(recall))
Recall of the binary classifier = 0.400
Since the dataset is high unbalanced, the model will perform better in the direction to the larger amount of samples. We got an accuracy of 0.925, wich can be seen as a high value, but if we look at the concusion matrix, the number of false negatives, false positives, and true positives are close to each other. Basically for the model, is all fail.
Now lets use the SMOTE method to balance the dataset, and perform the traning and prediction of a Logistic regression model.
newDF = oversamplig_dataframe(data, 'Fail', method='SMOTE')
X = newDF.drop(columns=['Cycle','Preset_1', 'Preset_2', 'Fail'])
y = newDF.drop(columns=['Cycle', 'Preset_1', 'Preset_2', 'Temperature', 'Pressure', 'VibrationX', 'VibrationY', 'VibrationZ', 'Frequency'])
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.20, random_state=0)
ss_train = StandardScaler()
X_train = ss_train.fit_transform(X_train)
ss_test = StandardScaler()
X_test = ss_test.fit_transform(X_test)
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
cm = confusion_matrix(y_test, predictions)
TN, FP, FN, TP = confusion_matrix(y_test, predictions).ravel()
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2a49a3d50>
accuracy = (TP + TN) / (TP + FP + TN + FN)
print('Accuracy of the binary classifier = {:0.3f}'.format(accuracy))
recall = TP / (TP + FN)
print('Recall of the binary classifier = {:0.3f}'.format(recall))
Accuracy of the binary classifier = 0.959 Recall of the binary classifier = 1.000
Increasing the number of data and balancing the dataset gives us a better prediction in this case. The accuracy and recall shows really hogh values. But the recall being 1.00 suggests that the model is overfitting here.
Let's apply the the oversampling only to the training set now.
X = data.drop(columns=['Cycle','Preset_1', 'Preset_2', 'Fail'])
y = data.drop(columns=['Cycle', 'Preset_1', 'Preset_2', 'Temperature', 'Pressure', 'VibrationX', 'VibrationY', 'VibrationZ', 'Frequency'])
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.20, random_state=0)
ss_train = StandardScaler()
X_train = ss_train.fit_transform(X_train)
ss_test = StandardScaler()
X_test = ss_test.fit_transform(X_test)
X_train, y_train = oversamplig_data(X_train, y_train, method='SMOTE')
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
cm = confusion_matrix(y_test, predictions)
TN, FP, FN, TP = confusion_matrix(y_test, predictions).ravel()
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2a6afb190>
accuracy = (TP + TN) / (TP + FP + TN + FN)
print('Accuracy of the binary classifier = {:0.3f}'.format(accuracy))
recall = TP / (TP + FN)
print('Recall of the binary classifier = {:0.3f}'.format(recall))
Accuracy of the binary classifier = 0.919 Recall of the binary classifier = 0.667
It performed a little better than the first model. Let's keep this dataset and use a set of different models and evaluate the performances.
models = {}
# Logistic Regression
models['Logistic Regression'] = LogisticRegression()
# Support Vector Machines
models['Support Vector Machines'] = LinearSVC()
# Decision Trees
models['Decision Trees'] = DecisionTreeClassifier()
# Random Forest
models['Random Forest'] = RandomForestClassifier()
# Naive Bayes
models['Naive Bayes'] = GaussianNB()
# K-Nearest Neighbors
models['K-Nearest Neighbor'] = KNeighborsClassifier()
accuracy, precision, recall = {}, {}, {}
for key in models.keys():
# Fit the classifier
models[key].fit(X_train, y_train)
# Make predictions
predictions = models[key].predict(X_test)
# Calculate metrics
accuracy[key] = accuracy_score(predictions, y_test)
precision[key] = precision_score(predictions, y_test)
recall[key] = recall_score(predictions, y_test)
df_model = pd.DataFrame(index=models.keys(), columns=['Accuracy', 'Precision', 'Recall'])
df_model['Accuracy'] = accuracy.values()
df_model['Precision'] = precision.values()
df_model['Recall'] = recall.values()
By analysing the values in table 6 we can see that the Random Forest and the K-Nearest Neighbor were the ones that performed better when considering the recall. In this case I consider the recall more importante because tell uss that the modell is not having that much false negatives.
| Accuracy | Precision | Recall | |
|---|---|---|---|
| Logistic Regression | 0.91875 | 0.666667 | 0.555556 |
| Support Vector Machines | 0.91250 | 0.666667 | 0.526316 |
| Decision Trees | 0.92500 | 0.333333 | 0.714286 |
| Random Forest | 0.94375 | 0.466667 | 0.875000 |
| Naive Bayes | 0.92500 | 0.933333 | 0.560000 |
| K-Nearest Neighbor | 0.93750 | 0.733333 | 0.647059 |
ax = df_model.plot.barh()
ax.legend(
ncol=len(models.keys()),
bbox_to_anchor=(0, 1),
loc='lower left',
prop={'size': 14}
)
plt.tight_layout()
plt.savefig('./figures/model_comparison.png', dpi=300, bbox_inches='tight')
Variable importance (Explainability)¶
One way to evaluate the feature contribution is by using the SHAP (SHapley Additive exPlanations) values. They are a popular and powerful method for explaining the output of machine learning models. They are based on Shapley values, a concept from cooperative game theory that distributes "payouts" (in this case, predictions) among players (features) based on their contribution to the total payout.
Let's see the importance of each variable to the Random Forest model
X_test_df = pd.DataFrame(X_test, columns=X.columns)
explainer = shap.Explainer(models['Random Forest'].predict, X_test)
shap_values = explainer(X_test_df)
shap.summary_plot(shap_values, plot_type='violin', show=False)
plt.savefig('shap_summary_plot.png', dpi=300, bbox_inches='tight')
The figure shows how much each variable is contributing to the output of the model. The colormap shows the intensity of the measurement and the x-axis shows the factor of the importance towards increasing the output value or decrising.
The behaviour is wuite similar for all the variables, but the Vibration in Y direction was the one with the larger spread in values, indicating that it is contributing more to the changes in the model outbut.
This is not a final analysis, but give us a clue on how out model is working.
Conclusion¶
In this study, we began by making initial assumptions about the relationships between our data points to ensure the reliability of our findings. We carefully examined the data to identify any patterns or connections that might affect our results.
Through a thorough exploration of the data, which involved various methods and the creation of charts, we discovered several key insights. Notably, there was a distinct correlation between higher readings of certain variables and an increased number of equipment failures on FPSO vessels. This discovery is vital as it aids in understanding potential causes of these failures.
Additionally, we observed that using a specific setting (Preset configuration 3-4) resulted in fewer equipment failures. This insight could be valuable for managing and mitigating failure risks.
Subsequently, we applied three straightforward mathematical models (logistic regression models) to determine which factors are most likely to contribute to equipment failures. This analysis helped us pinpoint critical factors to monitor in order to prevent failures.
Throughout our study, we operated under the assumption that each data point was independent, unaffected by preceding or succeeding data. While this assumption was crucial for our analysis, it is important to recognize it as a potential limitation.
Lastly, we employed classification models to enhance our understanding further. These models elucidated the relationships between various factors and their ability to predict future failures.
In conclusion, our research offers valuable insights into the causes of equipment failures on FPSO vessels and proposes methods for predicting and preventing such failures. This knowledge is instrumental in improving maintenance practices, aiming to enhance operational reliability and efficiency while reducing downtime.
Next Steps¶
For future steps, I would recommend developing more sophisticated models, such as Long Short-Term Memory (LSTM) or Convolutional Neural Networks (CNN) for time series prediction. By analyzing a batch of data preceding a failure, these models could potentially predict failures several cycles in advance.
A more detailed analysis concerning oversampling and data balancing should also be conducted. Additionally, further feature engineering to incorporate different features into the dataset could enhance our understanding of the problem and improve prediction accuracy.
Upon training the final model, it could be integrated into an application, such as the sensor itself, enabling it to detect and alert of a high probability of failure in upcoming cycles.
Another avenue could involve developing a platform where clients can access the model via an API, connecting it with equipment measurements directly.